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TITLE OF THE INVENTION 

PARALLEL COMPUTER HAVING A HIERARCHY STRUCTURE 

CROSS - REFERENCE TO RELATED APPLICATION 
S This application is based upon and claims the benefit of 

priority from the prior Japanese Patent Application No ,11-297439, 
filed October 19, 1999; the entire contents of which are 
incorporated herein by reference. 

10 BACKGROUND OP THE INVENTION 

1- Field of the Invention 

The present invention relates to a parallel computer having 

a hierarchy structure , and more particularly, to a parallel 

computer that may be most applied to image processing that 
15 requires enormous amount of calculation, computer 

entertainments, and execution of scientific calculations, 

2, Description of the Related Art 

In conventional parallel computers, for Instance , a 

20 conventional parallel computer having a common bus structure 
(or a common bus system) , a plurality of processors implemented 
with a plurality of semiconductor chips are arranged through 
a common bus formed on a semiconductor substrate. In this 
configuration, in order to further reduce the traffic of the 

25 common bus, a cache memory is incorporated in each layer when 
the common bus is formed in a hierarchy structure* 

In general, a multiprocessing computer system includes 
two or more processors that execute computing tasks* In this 
system, other processors execute other computing tasks that are 

30 independent from the above-dedicated computing task while one 
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processor executes a dedicated computing task, or the 
multi-processing computer system divides a specified computing 
task into plural execution elements , and then the plurality of 
processors in the multi- processing computer system execute these 
5 plural elements in order to reduce the total execution time of 
the computing task. In general, the processor is a device to 
execute operands of more than one and to generate and outputs 
the execution result „ That is , an arithmetic operation is 
performed according to Instructions executed by the processor. 

10 A general structure of an available multi-processing 

computer system has a symmetry multiprocessor (SMP) structure. 
In a typical example, the multiprocessing computer system of 
the SMF structure incorporates plural processors that are 
connected to a common bus through a cache hierarchy structure . 

IS In addition, a common memory that is used for the processors 
in this system is also connected to the common bus. An access 
to a specified memory location in the common memory is executed 
In a same time during access to other memories. Because each 
memory location in the common memory is accessed uniformly, the 

20 structure of the common memory is called to as a uniform memory 
architecture ( DMA ) . 

Xn many oases # the processors and an Internal cache are 
incorporated in a computer system, in a SMP computer system, 
one or more cache memories are formed between a processor and 

25 a common bus in cache hierarchy. The computer system having the 
common bus structure operates based on a cache coherency in order 
to maintain the common memory model in which a specific address 
indicates a data item preciously at any time. 

In general, when the result of arithmetic operation of 

30 data stored in a memory field corresponding to a specific memory 
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address lias been copied to a cache memory in a cache layer, the 
arithmetic operation is in a coherent state. For example, when 
a data item stored in a memory field addressed by a specific 
address is updated, the updated data item will be copied to the 
oache memory that has stored a previous data item. Or, the previous 
data item is nullified in a stage and the updated data item is 
transferred from the main memory in a following stage. In the 
common bus system, a snooping bus protocol is commonly used. 
Each coherent transaction that will be executed on the common 
bus is snooped (or detected) by comparison to the data item in 
the cache memory. When a copied data item that is affected by 
the execution of the above -coherent transaction is detected, 
the cache line belonging to the copied data item is updated 
according to the above -coherent transaction. 

The common bus structure, however, has several drawbacks 
to limit the feature of the multl- processing computer system. 
That is, there is a peak bandwidth (namely, the number of bytes 
per second to be transferred on the bus) to be used in the bus. 
When additional processors are connected to the common bus, the 
bandwidth for transferring data and instruction to the additional 
processors is over this peak bandwidth. When the bandwidth of 
one processor to be used is over the available bus bandwidth, 
some processors enter a waiting state until the bus bandwidth 
may be available . This reduces the performance of the computer 
system. In general, the maximum number of the processors to be 
connected to the common bus is approximately 32. Plural 
processors are connected to the common bus, the capacity load 
of the bus is increased and the physical length of the common 
bus is also increased. When the capacity load and the length 
of the bus are increased, the delay of the signal transfer on 



the bus Is also Increased. The Increasing of -the delay of the 
signal transfer also causes the Increasing of the execution time 
of a transaction. Accordingly , the plural processors are added 
into the common bus, the peak bandwidth of the bus is also 
5 decreased. 

These drawbacks described above are more Increased by 
increasing the performance of the processor and operation 
frequency. 

The micro -architecture of processors improved for a 

10 high- frequency demand requires a higher bandwidth when compared 
with the bandwidth for processors in a previous generation even 
if a same number of processors is connected to a bus - Accordingly m 
the bus having the adequate bandwidth for a multiprocessing 
computer system in a previous generation can not satisfy the 

15 demand of a current computer system including processors of a 
high performance. Further , there is a drawback that it becomes 
difficult to make a programming model and to perform a debug 
the multi-processing systems other than the systems having the 
common bus structure. 

20 There is therefore a requirement to have an architecture 

of a new multi-processing system that is capable of executing 
processors in parallel even if the performance of a 
microprocessor and a peripheral circuit is Increased and also 
even if the number of processors to be connected to a bus is 

25 increased. 

SUMMARY OP THE INVENTION 

Accordingly, an object of the present Invention is, with 
due consideration to the drawbacks of the conventional technique , 
30 to provide a parallel computer having a hierarchy structure 



4 



capable of executing in parallel high-speed processors of a 
desired number that have been made based on a leading edge 
technology. 

In accordance with a preferred embodiment of the present 
invention, a parallel computer having a hierarchy structure 
comprises an upper processing unit for executing a parallel 
processing task in parallel, and a plurality of lower processing 
units connected to the upper processing unit through a connection 
line - In the parallel computer, the upper processing unit divides 
the parallel processing task to a plurality of subtasks, and 
assigns the plurality of subtasks to the corresponding lower 
processing units and transfers data to be required for executing 
the plurality of subtasks to the lower processing units . The 
lower processing units execute the corresponding subtasks from 
the upper processing unit, and inform the completion of the 
execution of the corresponding subtasks to the upper processing 
unit when the execution of the subtasks is completed, and the 
upper processing unit completes the parallel processing task 
when receiving the information of the completion of the execution 
from all of the lower processing units. 

In accordance with a preferred embodiment of the present 
invention, a parallel computer having a hierarchy structure 
comprises an upper processing unit for executing a parallel 
processing task in parallel, a plurality of intermediate 
processing units connected to the upper processing unit through 
a first connection line, and a plurality of lower processing 
units connected to the intermediate processing units through 
a second connection line. In the parallel computer, the upper 
processing unit divides the parallel processing task to a 
plurality of first subtasks, and assigns the plurality of first 



subtasks to the corresponding intermediate processing units, 
and transfers data to be required for executing the plurality 
of first subtasks to the intermediate processing units. The 
Intermediate processing units divide the first subtasks to a 
5 plurality of second subtasks , and assigns the plurality of second 
subtasks to tlie corresponding lower processing units, and 
transfers data to be required for executing the plurality of 
second subtasks to the lower processing units. The lower 
processing units execute the corresponding second subtasks , and 

10 inform the completion of the execution of the second subtasks 
to the corresponding intermediate processing units when the 
execution of all of the second subtasks is completed* The 
intermediate processing units inform the completion of the 
execution of the corresponding second subtasks to the upper 

15 processing units when the execution of all of the first subtasks 
is completed. The upper processing unit completes the parallel 
processing task when receiving the information of the completion 
of the execution from all of the intermediate processing units . 

In the parallel computer described above, the lower 

20 processing units connected to -the connection line are mounted 
on a smaller area when compared, with the upper processing unit , 
and a signal line through which each lower processing unit is 
connected has a smaller wiring capacity, and an operation 
frequency for the lower processing units is higher than that 

25 for the upper processing unit. 

In the parallel computer described above, the lower 
processing units connected to the second connection line are 
mounted on a smaller area when compared with the intermediate 
processing units connected to the first connection line, and 

30 a signal line through which each lower processing unit is 
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connected lias a smaller wiring capacity, and an operation 
frequency for the lower processing units is higher than that 
for the intermediate processing units. 

In the parallel computer described, above , each of the upper 
processing unit and the lower processing units has a processor 
and a memory connected to the processor. 

In the parallel computer described above , each of the upper 
processing unit, the intermediate processing units, and the lower 
processing units has a processor and a memory connected to the 
processor* 

In the parallel computer described above, the upper 
processing unit receives information regarding the completion 
of the subtask from each lower processing unit through a status 
signal line* 

In the parallel computer described above, each 
intermediate processing unit and the upper processing unit 
receives information regarding the completion of the second 
subtask and the first subtask from each lower processing unit 
and each intermediate processing unit through a status signal 
line , respectively < 

In the parallel computer described above, each lower 
processing unit comprises a processor, and a memory and a DMA 
controller connected to the processor. 

In the parallel computer described above, each 
intermediate processing unit comprises a processor , and a memory 
and a DMA controller connected to the processor. 

In the parallel computer described above, the processor 
and the DMA controller are connected in a coprocessor connection . 

In the parallel computer described above, the upper 
processing unit compresses the data to be required for executing 



the subtasks, and then transfers the compressed data to the 
corresponding lower processing units. 

In the parallel computer described above, the upper 
processing unit compresses the data to be required for executing 
the first subtasks, and then transfers the compressed data to 
the corresponding intermediate processing units. 

In the parallel computer described above , eaoh 
intermediate processing unit compresses the data to be required 
for executing the second subtasks, and then transfers the 
compressed data to the corresponding lower processing units. 

In the parallel computer described above , each 
Intermediate processing unit is a DMA transfer processing unit • 
In the parallel computer described above , the DMA transfer 
processing unit is a programmable. 

In the parallel computer described above, each lower 
processing unit is mounted with the upper processing unit as 
a multi-chip module on a board* 

In the parallel computer described above, each 
intermediate processing unit and the corresponding lower 
processing units are mounted with the upper processing unit as 
a multi-chip module on a board* 

In the parallel computer described above, each of the 
upper processing unit and the lower processing units is formed 
on an independent semiconductor chip, and each semiconductor 
chip is mounted as a single multi-chip module. 

In the parallel computer described above, each of the 
intermediate processing units, the corresponding lower 
processing units, and the upper processing unit is formed on 
an Independent semiconductor chip, and each semiconductor chip 
is mounted as a single multi-chip module. 
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In the parallel computer described above, a structure of 
each, of the connection line, the first connection line, and the 
second connection line is a common bus connection. 

In the parallel computer described above, a structure of 
5 each of the connection line, the first connection line, and the 
second connection line is a cross-bus connection. 

In the parallel computer described above , a structure of 
each of the connection line, the first connection line, and the 
second connection line is a star connection. 
10 In the preferred embodiment of the present invention, the 

processing unit in the intermediate stage of the multiprocessor 
system of a hierarchy structure comprises a processor having 
a same function of a normal processor, an instruction memory, 
and a data memory* The processing unit in the intermediate stage 
15 receives a status signal from the lower processing unit, and 
a PMA controller (as having a. data transfer memory of a large 
size compresses received data and decompresses transfer data 
to be transferred, and performs a programmable load dispersion 
or a load dispersion according to the operating state - 

20 

BRIEF DESCRIPTION OF THE DRAWINGS 

These and other objects, features, aspects and advantages 
of the present Invention will become more apparent from the 
following detailed description of the present invention when 

25 taken in conjunction with the accompanying drawings, in which; 

FIG-1 is a block diagram showing an overview of a 
multiprocessor system having a hierarchy bus structure according 
to a first embodiment as the parallel computer having a hierarchy 
structure of the present invention; 

30 FIG. 2 is a block diagram showing an arrangement of multi 
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chip modules on a board on which the parallel computer shown 
In FIG.l is mounted; 

FIG. 3 is a block diagram showing an overview of a 
multiprocessor system having a hierarchy bus structure as the 
5 parallel computer having a hierarchy structure according to a 
second embodiment oC the present invention; 

FIG. 4 Is a block diagram showing a configuration of amultl 
chip module in which the parallel computer is implemented; 

FIG - 5 is a block diagram showing one configuration of an 
10 intermediate hierarchy unit; 

FIG* 6 is a diagram explaining a collision decision among 
objects shown in an image; 

FIG. 7 is a diagram explaining a collision decision among 
objects shown in an image; 
15 FIG, 8 is a diagram explaining a collision decision among 

objects shown in an image; 

FIG. 9 is a diagram showing the comparison in performance 
between multiprocessor systems of a hierarchy bus structure of 
both a prior art and the present invention; and 
20 FIG. 10A and FIG. 10B are diagrams each showing a connection 

structure of the parallel computer having a hierarchy structure 
according to the present invention « 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 
25 Other features of this invention will become apparent 

through the following description of preferred embodiments which 
are given for illustration of tbe invention and are not intended 
to be limiting thereof. 

30 First embodiment 
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FIG. 1 is a block diagram allowing an overview of a 
multiprocessor system having a hierarchy bus structure according 
to a first embodiment as the parallel computer having a hierarchy 
structure of the present invention. 
5 The multiprocessor system having a hierarchy bus structure 

shown in FIG.l comprises a GHQ main memory of 1 Gbytes, a GHQ 
processor 113, and four SQUAD processing units 120 each of which 
Incorporates a plurality of processors (that will be described 
later in detail}. Each SQUAD processing unit 120 is implemented 

10 with a multi-chip module (MCM) * The GHQ processor 113 , the four 
SQUAD processing units 120, and the GHQ main memory 111 are 
connected through a first level bus 112. 

The six component units, namely, a memory module forming 
the GHQ main memory 111, the GHQ processor 113, and the four 

15 MCMs are connected to each other on a print wiring board 101. 
As shown in FlG.2 r each of the four SQARD processing units 120 
is mounted as the MCM on the pint wiring board 101. 

FIG. 2 is a block diagram showing an arrangement of MCMs 
on the print wiring board 101 on which the parallel computer 

20 having a hierarchy structure according to the first embodiment 
shown In FIG, 1 is mounted. In general, the MCM is formed by a 
plurality of unpackaged semiconductor integrated circuits to 
be Incorporated in a subsystem in a normal single semiconductor 
chip package. 

25 One type of MCM comprises a substrate (or a board) , a thin 

film connector structure , and a plurality of integrated circuits 
which are connected to the thin film connector structure and 
surrounded by an epoxy passivation material. The MCM structure 
gives to uses a feature to realise a higher frequency performance 

30 when compared with the print wiring board that is formed by a 
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conventional plating through-hole and surface mounting 

technology. That la, as shown in FIG, 2, It Is possible to reduce 

both the wiring capacity and a transfer length by packaging the 

multichips 121, 123, and the four modules 130 into the multiohip 
5 module MCM 120, In general, this configuration increases the 

performance of the computer system. 

MCM requires a high-density structure of wiring in order 

to transfer signals among IC chips 10 la to 10 If mounted on a 

common substrate formed by the plural layers 102A to 102E* 
10 By the way, it is possible to use an optional number of the layers 

in order to adopt a dedicated fabrication technology and a wiring 

density to be required in design. 

As shown in FIG. 2, the IC chips 101c and 10 Id correspond 

to the DMA memory module 121 as one chip and the SQUAD processor 
15 123 , respectively. The IC chips 101a, 101b, lOle, and 10 If 

correspond to the FLIGHT processing units 130, respectively. 

The common bus is formed on each of the plural layers 10 2 A to 

102E, 

A multilevel ceramic substrate technology that has been 
20 described in the prior art document Japanese patent laid-open 
publication number JF-A- 10/56036 is used for the configuration 
of the first embodiment shown in FIG. 1. It is, however, possible 
to use another optional technology in a same level. 

In FIG, 2, each of the layers 102A to 102E is formed by 
25 using an insulation ceramic material on which a patterned 
metallzatlon layer has been formed. A part of each of the layers 
102Ato 102D has been eliminated, so that amultl-cavity structure 
is formed. A part of each of the patterned metallzatlon layer 
is exposed around the multi-cavity portion - 
30 The exposed part in the layer 102E forms a mounting section 
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for chips. The exposed part is coated on a raetalizatioa ground 
surface on which the IC chips 101a to 10 If are mounted by a chip 
mounting technology by using a conductive epoacy, a solder, or 
the like* 

S Each of the layers 102b to 102D has signal wirings through 

which digital data signals are transferred from the IC chips 
101a to lOlf to MCM input/output pins or terminals (not shown) , 
The layer 10 2A is capable of performing chemical, 
mechanical, and electric protection for lower layers that are 
10 formed in a lower section* In addition to the feature, a package 
cap is also mounted on the layer 102A. 

Printing wirings, I/O pins, and terminals are formed on 
the layers 102B to 102D by using available MCM technology, so 
that the MCM 100 can be connected to outer circuits* In wire 
15 bonding, bonding pads at one edge of each of the IC chips 101a 
to 10 If are connected to selected conductors or bonding pads 
of the layers 10 IB to 10 2D* The configuration described above 
can enlarge the bandwidth of the second level in a lower stage 
when compared with the bandwidth of the printing wiring board 
20 as the upper stage* 

Similarly, a plurality of FIGHTER processing units are 
mounted in the FLIGHT processing unit 130 where the plural FIGHTER 
processing units are connected in a single silicon substrate 
that is higher in signal transfer when compared with the MCM 
25 structure . It is therefore possible to achieve a wider bandwidth * 
Thus , the present invention has a feature to provide that 
the processing units in a lower stage are more integrated and 
may operate at a higher frequency, 

The GHQ processing unit 110 at the uppermost stage monitors 
30 the entire operation of the parallel computer system. The GHQ 
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processing unit 110 comprises the one chip GHQ processor 113 
and the GHQ main memory 111 , In the configuration shown in FIG, 1 , 
the number of the stages is four, namely, the GHQ processing 
unit 110, the SQUAD processing units 120, the FLIGHT processing 
units 130, and the FIGHTER processing units 140. The GHQ 
processing unit 110 is directly connected to the four SQUAD 
processing units 120, each comprises the FLIGHT processing unit 
130 and the FIGHTER processing unit 140 . The GHQ processing unit 
110 # the SQUAD processing unit 120, and the GHQ main memory 111 
are connected to each other through the first level bus 112 (as 
a common bus) of 32 bit width, and the entire bandwidth is 256 
Mbytes/sec (frequency 66MHz). 

The SQUARD commander processor 123 in each SQUAD processing 
unit 120 controls the entire operation of the unit 120. The SQUAD 
commander processor 123 is connected to the SQUAD instruction 
memory 125, the SQUAD data memory 127, and the SQUAD DMA memory 
121- The SQUAD processing unit 120 is integrated on a single 
semiconductor chip, as shown in FIG. 2 . 

The SQUAD commander processor 123 is directly connected 
to the four FLIGHT processing units 130 as the following stage. 
The four FLIGHT processing unit 130 controls the entire operation 
of the sixteen FIGHTER processing units 140- 

The SQUAD commander processor 123 is connected to the 
FLIGHT proceasing unit 130 through the second level bus 114 of 
64 bits/sec. Accordingly, the entire bandwidth becomes 800 
Mbytes/sec (frequency lOOHHz) . 

The FLIGHT commander processor 133 in each FLIGHT 
processing unit 130 controls the entire operation of each unit 
130 * The FLIGHT commander processor 133 Is connected to the FLIGHT 
instruction memory 135 , the FLIGHT data memory 137 , and the FLIGHT 
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DMA memory 131- The FLIGHT processing unit 130 is integrated 
on the single semiconductor chip of the SQUAD processing unit 
X20, as shown in PIG. 2. 

The FLIGHT commander processor 133 is directly connected 
5 to the sixteen FIGHTER processor 143 In the FIGHTER processing 
units 140 as the following stage r each of which includes a FIGHTER 
memory 141- The FLIGHT commander processor 133 in each FLIGHT 
processing unit 130 ia connected to the FIGHTER processor 143 
through the bus 118 of 128 bits/sec. Accordingly, the entire 

10 bandwidth becomes 2128 Mbytes/sec (frequency 133MHz). The 
operation frequency of the FIGHTER processor 143 is 533MHz. 

The GHQ processing unit 110 divides a program (or a task} 
into a plurality of sub -programs ( or into a plurality of subtasks > 
and sends the divided sub -programs to each of the SQUAD processing 

15 units 120. After the division process, the GHQ processing unit 
110 compresses the sub -programs (or subtasks) and then the 
compressed them to the SQUAD processing unit 120* There are 
Run- length method or Huffman code method as the compression 
algorithm. The compression method is selected according to the 

20 characteristic of data to be compressed. If it is not necessary 
to use any data compression, the subtasks that have not been 
compressed are transferred to the SQUAD processing units 120. 

In the parallel computer having a hierarchy structure 
according to the present invention, the task is divided into 

25 a plurality of subtasks, and if necessary, the compression for 
the subtasks that has been divided are executed, and then 
transferred to the following stage. Therefore the size of the 
subtask is more decreased at the processing unit in a lower stage, 
and the increasing of the bandwidth can be suppressed even if 

30 the operation frequency becomes high. 
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When receiving the task data (or compressed task data if 
aeaesBar;} from the GHQ processor 113 In the GHQ processing unit 
110, the SQUAD commander processor 123 in the SQUAD processing 
unit 120 sends to the GHQ processing unit 110 the information 
5 that the status of the SQUAD processing unit 120 enters a busy 
state. Then, when the received task has been compressed, the 
SQUAD commander processor 123 decompresses the received task 
data. 

On the contrary, the SQUAD commander processor 123 in the 

10 SQUAD processing unit 120 further divides the received task in 
order to assign the divided task (or the subtask) to each PLIGHT 
processing unit 130, After the completion of the division of 
the task to obtain the subtasks, the SQUAD processing unit 120 
compresses the divided task and then transfers them to the FLIGHT 

15 processing units 130. If it is improperly or not necessary to 
divide the task, the task that has not been divided is transferred 
to the FLIGHT processing units 130 • When receiving ttie task data 
from the SQUAD processing unit 120 (or compressed task data if 
necessary) „ the FLIGHT processing unit 130 sends to the SQUAD 

20 processing unit 120 the request to set the status of the FLIGHT 
processing unit 130 to the busy state, Tlien, when tbe received 
task has been compressed, the FLIGHT processing unit 130 
decompresses the received task data. 

The FLIGHT processing units 130 further divided the 

25 received task into a plurality of tasks and then transfers the 
divided task data item to each FIGHTER processing unit 140. Where , 
tlie task data means the content of the processing and necessary 
data. That is, the main function of both the QUAD processing 
unit 120 and the FLIGHT processing unit 130 as an intermediate 

30 node is a scheduling and data transfer. The FIGHTER processing 



16 



units 140 at the lowermost stage performs the actual processing 
of the task. When receiving the task data, the FIGHTER processing 
units 140 sends to the FLIGHT processing unit 130 in the upper 
stage the request to set the status of the corresponding FIGHTER 
5 processing unit 140 to the busy state, and then the FIGHTER 
processing unit processes the received task data. After the 
completion of the process of the task data, the FIGHTER processing 
unit 140 transfers the operation result to the F LIGHTER commander 
processor 123 in the FLIGHTER processing unit 130, and then the 
10 status of the FIGHTER processing unit 140 is set to the idle 
state* 

When detecting the FIGHTER processing unit 140 in the 
idle state, the FLIGHTER processing unit 130 assigns the task 
data that has not been processed to the FIGHTER processing unit 

15 140 in the idle state, When all of the task data items divided 
by one FLIGHT processing unit 130 have been processed by the 
FIGHTER processing units 140, the FLIGHT processing unit 130 
transfers the operation result to the SQUAD processing unit 120, 
and then this SQUAD processing unit 120 sets the status of the 

20 FLIGHT processing unit 130 from the busy state to the idle state. 

Like the operation of the FLIGHT processing unit 130 , when 
detecting the FLIGHT processing unit 130 in the idle state, the 
SQUAD processing unit 120 assigns un-processed task to this 
FLIGHT processing unit 130 r Similarly, when receiving the 

25 operation results from all of the FLIGHT processing units 130 
in the lower stage r the SQUAD processing unit 120 sends the 
operation result to the GHQ processing unit in the uppermost 
stage. Thereby, the GHQ processing unit 110 sets the idle state 
to the SQUAD processing unit 120. 

30 That is, like the operation of the FLIGHT processing unit 
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130, when detecting the SQUAD processing unit in the idle state 
and when there is un-processed task, the GHQ processing unit 
110 assigns the un-processed task to the SQUAD processing unit 
120. When the SQUAD processing unit 120 completes the operation 
of all of the tasks from the GHQ processing unit 110 . the operation 
of the given program is completed. 

As described above, the FIGHTER processing units 140 
in the lowermost stage, the FLIGHT processing units 130 and SQUAD 
processing units 120 in the intermediate stage, and the GHQ 
processing unit 110 in the uppermost stage perform different 
operation to each other. 

That is , because each FIGHTER processing unit 140 performs 
the actual arithmetic operation , it is not necessary to have 
the function to perform complicated decisions and routines, but 
it is necessary to have a function of a high arithmetic calculation . 
Accordingly, it ±s preferable that each FIGHTER processor 143 
has plural integer arithmetic units and floating-point 
arithmetic units. In this embodiment, for example, the FIGHTER 
processing unit 140 includes one Integer arithmetic unit and 
two floating-point arithmetic units. In this embodiment, hazard 
proceaaing circuits and interrupt circuits to Increase high speed 
operation are omitted. Accordingly, when the operation frequency 
is 533 MHz, it is possible for the parallel computer having a 
hierarchy structure of this embodiment to perform the operation 
of 1.G66GFLOPS. 

On the other hand, the function of the SQUAD processing 
units 120 and the FLIGHT processing units 130 in the intermediate 
stage is a brofcer , namely, they control the data transfer between 
the upper stage (or the uppermost stage) to the lower stage (or 
the lowermost stage} * Accordingly, it is adequate that each of 
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the SQUAD commander processor 123 and tbe FLIGHT commander 
processor 133 Incorporates an arithmetic unit of the smallest 
operation size. In this embodiment, each of tbe SQUAD commander 
processor 123 and the FLIGHT aommander processor 133 incorporates 
one integer arithmetic unit. 

Because the GHQ processing unit 110 executes amain program, 
a general -purpose processor is used as the GHQ commander 
processor 113, Accordingly, it is possible to use a 
microprocessor of a high performance for the GHQ commander 
processor 113, 

The configuration of the first embodiment of the present 
invention is realized based on the following technical idea. 

The six components , the memory module forming the GHQ main 
memory 111, the GHQ processor 113, and the f our multi -chip modules 
120 are synchronisation with the clock of 66MH*- In this stage, 
the frequency of this synchronous clock is suppressed to a 
relatively low value because it is necessary to synchronize the 
six components in a wide area. 

Next, each SQUAD processing unit 120 receives the 
synchronous clock of 66MHz from the GHQ processing unit 110, 
and a Phase Locked Loop (PLL) (not shown) generates the 
synchronous clock of 100MHz that is 1 - 5 times of the synchronous 
clock of 66HHz. This synchronous clock of 100 MHz; is used as a 
synchronous clock in each SQUAD processing unit 120. The four 
FLIGHT processing units 130, the SQUAD commander processor 123, 
the SQUAD instruction memory 125 # the SQUAD data memory 127, 
and the SQUAD DMA memory 121 operate in synchronization with 
this synchronous clock of 100MHz. One region in the SQUAD 
processing unit 120 is integrated to a part of the area of the 
GHQ processing unit 110, so that a signal transfer length and 
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a signal skew may be decreased, and it is possible to operate 
at a high frequency. 

Next, each FLIGHT processing unit 130 receives the 
synchronous clock of 100MHz from the SQUAD processing unit 120, 
and a PLL (not shown) or anotber circuit generates the synchronous 
clock of 133MHz that is approximately 1 . 5 times of the synchronous 
clock of XQOMHz. This synchronous clock of 133MHz is used as 
a synchronous clock in each FLIGHT processing unit 130, The 
sixteen FIGHTER processing units 140, the FLIGHT commander 
processor 133, the FLIGHT instruction memory 135, the FLIGHT 
data memory 137, and the FLIGHT DMA memory 131 operate in 
synchronisation with this synchronous clock of 133MHz , One region 
in the FLIGHT processing unit 130 is Integrated to a part of 
the area of the SQUAD processing unit 120, so that it is possible 
to operate at a higher frequency. 

Furthennore, each FIGHTER processing unit 140 receives 
the synchronous clock of 133MHz from the FIGHTER processing unit 
130, and a PLL (not shown) or another circuit generates the 
synchronous clock of 266MHz that is approximately 2 times of 
the synchronous clock of 133MHz . This synchronous clock of 266MHz 
is used as a synchronous clock in each FIGHTER processing unit 
140. Then, a PLL (not shown) or another circuit generates the 
synchronous clock of 533MHz that is approximately 2 times of 
the synchronous clock of 2€6HHz . This synchronous clock of 533MHz 
is used as an operation clock only for each FLIGHT commander 
processor 133 - The FLIGHT commander processor 133 and the FIGHTR 
memory 141 operate in synchronisation with this synchronous clock 
of 266MHz* One region in the FIGHTER processing unit 140 is 
integrated to a part of the area of the FLIGHT processing unit 
130, so that it is possible to reduce both a signal transfer 
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length and a signal skew, ana also possible to operate at a high 
frequency - 

Next, a description will be given of the configuration 
of the intermediate stage, namely, the configuration of the SQUAD 
5 processing unit 120 and the FLIGHT processing unit 130, in the 
parallel computer according to the present invention. 

FIG -5 is a block diagram showing one example of the 
configuration of the intermediate hierarchy unit such as the 
SQUAD processing unit 130 and the FLIGHT processing unit 130, 

10 In the configuration of the intermediate stage shown in 

FIG, 5 , the general-purpose processor as the GHQ commander 
processor 123 is connected to a Direct Memory Access (DMA) 
controller 151 of 10 channels. Because this DMA controller 151 
and the general -purpose processor 123 are in a coprocessor 

15 connection, it is possible to use an available DMA controller - 
The DMA controller 151 is connected to a bus through which 
a memory 121 of a large memory size (as the SQUAD DMA memory) r 
a connection line to the upper stage, and a connection line to 
the lower stage are connected. A processor core in the 

20 general -purpose processor 123 has signal lines through which 
a status signal from each processor in the lower stage is 
transferred* For example, one SQUAD processing unit 120 receives 
status signals through four status signal lines connected to 
the four FLIGHT processing units 130 in the lower stage. Each 

25 status signal line is one bit or more . The status signal indicates 
whether the processor in the lower stage is in the busy state 
or the idle state. 

The SQUAD commander processor 123 is connected to the SQUAD 
instruction memory 125 and the SQUAD data memory 127 in which 

30 programs and data to be used for the SQUAD commander processor 
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123 are stored* These programs expands (or unwinds) data 
transferred from the upper stage if necessary, analyses commands 
also transferred from tlie upper stage, and performs required 
processes* Then, these programs assign tasks and perform 
5 scheduling, and finally transfer the data to he processed to 
the lower stage. 

As one concrete example, first, the data to be processed 
that are assigned to the target processing unit are transferred 
to the DMA transfer memory. Second, the data are transferred 

10 to the processing unit in the lower stage that Is capable of 
processing the data. This algorism may be implemented by the 
program that has been stored in the SQUAD data memory 127. In 
other words, the processing unit in the intermediate stage 
fulfills the function as an intelligent DMA system in the entire 

15 parallel computer of the present invention* 

In a case of a system only for a specialized process, 
that is not necessary to apply a versatile purpose, for example 
a graphic simulator and the like, it is possible to implement 
the processors other than the GHQ commander processor 113 by 

20 using non-Neumann type DMA processor including the DMA controller 
that Is implemented by the hardware. 

Next , a description will be given of the memory structure 
used in the first embodiment. 

The easiest implementation of the memory structure is that 

25 each of the full processors in the parallel computer has a local 
memory space. Because each processor has a corresponding local 
memory space, it is not necessary to prepare any snoop-bus 
protocol and any coherent transaction. In this case, the memory 
space for the GHQ processor 113 is mapped only in the GHQ main 

30 memory 111. The memory space of the SQUAD commander processor 
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1 2 3 i s mapped in the SQUAD DMA, memory 121 with the SQUAD ins true t Ion 
memory 125 and the SQUAD data memory 127. 

The memory space for the GHQ processor 113 and the memory 
space of the SQUAD commander processor 123 are Independently 
5 to each other - Furthermore , each of the different SQUAD commander 
processors 123 is independently to each other. 

Similarly, the memory apace of the FLIGHT commander 
processor 133 is mapped in the FLIGHT DMA memory 131 with the 
FLIGHT instruction memory 135 and the FLIGHT data memory 137. 
10 The memory space of the FLIGHT commander processor 133 is 
Independently from the memory spaces of both the GHQ processor 
113 and the SQUAD commander processor 133. Moreover, each of 
the FLIGHT commander processors 133 is Independently to each 
other - 

IS Similarly, the memory space of each FIGHTER processor 143 

is mapped in the corresponding FIGHTER memory 141 of 6 4Kbyte s- 
The memory space of the FIGHTER processor 143 is Independently 
from the memory space for the GHQ processor 113 , the memory space 
of each of the SQUAD commander processor* 123. Furthermore, each 

20 of the FIGHTER processors 143 is independently to each other* 
It is also possible to divide the memory space of the GHQ 
processor 113 into a plurality of memory spaces in order to map 
the divided memory spaces for the full processors in the parallel 
computer according to the first embodiment - In this configuration , 

25 a move Instruction in the GHQ memory 111 is used for the data 
transfer between the upper stage and the lower stage. 

The move instruction for the memory may be implemented 
as a DMA command to be used between the upper stage and the lower 
stage. In this case, there is a method to share the same address 

30 by both the actual memory of the SQUAD processing unit 120 and 
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the actual memory of the GHQ processing unit 110, However, since 
the program executed by the GHQ processor 113 controls completely 
the execution state of the full processing units, it is not 
necessary to prepare any snoop -bus protocol and. any coherent 
5 transaction, similarly, the actual memory of the FLIGHT 
processing units 130 and the SQUAD processing units 120 share 
the same address. In addition, both the actual memory of the 
FIGHTER processing units 140 and the actual memory of the FLIGHT 
processing units 130 share the same address* 

10 By the way, the multiprocessor system of a hierarchy bus 

structure according to the first embodiment shown in FIG.l has 
the configuration in which the four semiconductor chips as the 
four SQUAD HGMs 120, the GHQ processor 113 (not shown in FIG- 2) , 
and the main memory 111 (not shown in FIG* 2) are mounted on the 

15 single board 101. 

On the contrary, the multiprocessor system of a hierarchy 
bus structure according to the second embodiment shown in FIG. 4 
has the configuration in which four SQUAD chips 220, a GHQ 
processor 213 , and a main memory 211 are incorporated in a single 

20 semiconductor chip as a multi-chip module (MCH) . The 
configuration and the operation of the second embodiment will 
be explained later in detail. 

Second embodiment 
25 FIG, 3 is a block diagram showing an overview of a 

multiprocessor system having a hierarchy bus structure as the 

parallel computer having a hierarchy structure according to the 

second embodiment of the present Invention. 

The multiprocessor system of a hierarchy bus structure 
30 shown in FIG. 3 comprises a GHQ main memory of 1 Gbytes formed 
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on a single semiconductor chip, a GHQ processor 213 formed on 
a single semiconductor chip, and four SQUAD processing units 
220 each of which incorporates a plurality of processors (that 
will bo described in detail) • Each SQUAD processing unit 220 
5 is formed on a single semiconductor chip* 

The GHQ processor 213, the four SQUAD processing units 
220, and the GHQ main memory 211 are connected through a first 
level bus 212- 

The six component units, namely, a memory module forming 

10 the GHQ main memory 211, the GHQ processor 213, and the four 
SQUAD processing units 220 are mounted on a single multlohlp 
module (MCM). In general, the MCH is formed by a plurality of 
unpackaged semiconductor integrated circuits that are 
incorporated as sub systems in a package of a normal single 

15 semiconductor chip. One type of the MCM comprises a substrate 
(or a board) , a thin film connector structure of a desired circuit 
structure, and a plurality of integrated circuits connected to 
the thin film connector structure and surrounded by an epoxy 
passivation material . The MCM structure gives to users a feature 

20 to realize a higher frequency performance when compared with 
the print wiring board that is formed by a conventional plating 
through-hole and surface mounting technology. That is, as shown 
in PIG- 4, it is possible to reduce both the wiring capacity and 
a transfer length by packaging the multichlps on the substrate, 

25 In general, this configuration Increases the performance of the 
computer system. 

FIG, 4 is a block diagram showing a configuration of the 
multi-chip module in which the parallel computer according to 
the second embodiment is mounted. 

30 MCM requires a high -density structure of wiring in order 
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to transfer signals among IC chips 201a to 20 If mounted on a 
common substrate formed by the plural layers 202A to 202E. By 
the way, it is possible to use an optional number of the layers 
for the adapting of a dedicated fabrication technology and a 
wiring density to be required for design. 

As shown in FIG, 4, the IC chips 201c and 201d correspond 
to the GHQ main memory module 211 and the GHQ processor 213, 
respectively. The IC chips 201a, 201b, 201e, and 201f correspond 
to the SQUAD processing units 220 , respectively* This 
configuration of the second embodiment shown in FIG. 4 is 
different from that of the first embodiment shown in FIG. 2. 

As shown in FIG. 4 , the wiring as a first level bus is formed 
on each of the plural layers 202A to 2Q2E. 

In the configuration of the first embodiment shown in 
FIGs.l and 2, the multilevel ceramic substrate technology that 
has been described in the prior art document Japanese patent 
laid-open publication number JP-A-10/56036 is used. It is, 
possible to use the same technology for the second embodiment - 
In the case of the first embodiment shown in FIG. 2, each 
of the layers 102A to 102E is formed by using an insulation ceramic 
material on which a patterned metallzation layer has been formed . 

In the configuration of the second embodiment shown in 
FIG. 4 , a part of each of the layers 202A to 202D has been eliminated , 
so that a multi- cavity structure is formed* A part of each of 
the patterned metallzation layer in each of the layers 202B to 
202E is exposed around the multi -cavity portion. 

The exposed part in the layer 20 2K forms a mounting section 
for chips . Tbe exposed part is coated on a metallzation ground 
surface on which the IC chip 201a to 20 If are mounted by a chip 
mounting technology such as a conductive epoxy, a solder, or 
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the liXe. 

Each of the layers 20 2B to 20 2D has signal wirings through 
which digital data signals are transferred from the IC Chips 
201a to 201£ to MCM input/output pins or terminals (not shown) . 
5 The layer 202A Is capable of performing chemical, 

mechanical, and electric protection for lower layers that are 
formed In a lower section. In addition to the feature, a package 
cap is also mounted on the layer 10 2A* 

Printing wirings, I/O pins, and terminals are formed on 

10 the layers 202B to 202D by using available MCM technology, so 
that the MCM 201 can be connected to outer circuits. In wire 
bonding, bonding pads at one edge of each of the IC chips 201a 
to 20 If are connected to selected conductors or bonding pads 
of the layers 201B to 202D. 

15 The configuration described above can enlarge the 

bandwidth of the first level when compared with the bandwidth 
of the printing wiring board. Similarly, a plurality of FLIGHT 
processing units 230 are mounted in the SQUAD processing unit 
220 where they are connected on a single silicon substrate that 

20 has an advantage to operate at a high speed, it is thereby possible 
to achieve a wider bandwidth when compared with the MCM structure - 
Thus, the present invention has a feature to provide that 
the processing units in a lower stage are more Integrated and 
may have a higher operation frequency- 

25 The GHQ processing unit 210 at the uppermost stage monitors 

the entire operation of the parallel computer system. The GHQ 
processing unit 210 comprises the one chip GHQ processor 213 
and the GHQ main memory 211 . In the configuration shown in FIG. 4 , 
the number of the stages is four, namely, the GHQ processing 

30 unit 210, the SQUAD processing units 220, the FLIGHT processing 
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unit 230, and the FIGHTER processing units 240. 

The GHQ processing unit 210 is directly connected to tlie 
four SQUAD processing units 220, the FLIGHT processing units 
230, the FIGHTER processing units 240 as a lower stage. 
5 The GHQ processing unit 210 and the SQUAD processing units 

220, and the GHQ main memory 211 are connected to each other 
through the ten HAH buses, so that the entire bandwidth becomes 
16Gbytes/sec (frequency 400MHzX2). 

The six component elements, the memory module forming the 
10 GHQ main memory 211 and the GHQ processor 213 and the four 
multi-chip modules 220 are synchronized with the synchronous 
clock of 187*5KHz. Accordingly, each SQUAD processing unit 220 
inputs the synchronous clock of 187 . 5MHz from the GHQ processing 
unit 210- 

1 5 The SQUARD commander processor 223 in each SQUAD processing 

unit 220 controls the entire operation of the unit 220. The SQUAD 
commander processor 223 is connected to the SQUAD instruction 
memory 225, the SQUAD data memory 227, and the SQUAD DMA memory 

221, The SQUAD processing unit 220 is Integrated on a single 
20 semiconductor chip, as shown in FIG. 4 . 

The SQUAD commander processor 223 is directly connected 
to the four FLIGHT processing units 230 as the following stage. 
The four FLIGHT processing unit 230 controls the entire operation 
of the sixteen FIGHTER processing units 240. 

25 The SQUAD commander processor 223 is connected to the 

FLIGHT processing unit 230 through the bus of 6,144 bit-width. 
Accordingly, the entire bandwidth becomes 388Gbytes/sec 
( frequency 3 7 5MHz } , 

The four FLIGHT processing units 230, the SQUAD commander 

30 processor 223, the SQUAD instruction memory 225, the SQUAD data 
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memory 227, and the SQUAD DMA memory 221 operate In 
synchronization with the synobronous clock of 375MHz- 
Accordingly, each FLIGHT processing unit 230 inputs tiie 
synchronous clock of 375MHz from the corresponding SQUAD 
5 processing unit 220. 

The FLIGHT commander processor 233 in each FLIGHT 
processing unit 230 controls the entire operation of each unit 
230 - The FLIGHT commander processor 233 1s connected to the FLIGHT 
instruction memory 235 , the FLIGHT data memory 237 , and the FLIGHT 

10 DMA memory 231 • The FLIGHT processing unit 230 is integrated 
on the single semiconductor chip of the SQUAD processing unit 
220, as shown in FIG. 4. 

The FLIGHT commander processor 233 is directly connected 
to the sixteen FIGHTER processing units 240 each comprising the 

1 5 FIGHTER processing units 243 and the FIGHTER memory of 64kbytes „ 
The sixteen FIGHTER processors 243, the FLIGHT commander 
processor 233, the FLIGHT instruction memory 235, the FLIGHT 
data memory 237, and the FLIGHT DMA memory 231 are synchronized 
by the synchronous clock of 750MHz, Accordingly, each FIGHTER 

20 processing unit 240 Inputs the synchronous clock of 750MHz from 
the corresponding FLIGHT processing unit 230- 

The FLIGHT processing unit 230 and the FIGHTER processor 
243 are connected to each other through the bus of 1028 bit-width. 
Accordingly, the entire bandwidth becomes 99Gbytes/seo 

25 (frequency 750MHz), The operation frequency of the FIGHTER 
processor 243 is 1.5GHz. 

The GHQ processing unit 210 divides a program (or a task) 
into a plurality of subprograms (or a plurality of subtasks) 
and sends the divided sub-programs to each of the SQUAD processing 

30 units 220 . After the division process of the program or the task, 
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the GHQ processing unit 110 compresses the sub-programs (or 
subtasks) and then transfers tlie compressed them to the SQUAD 
processing unit 120* There are Run-length method or Huffman code 
method as the compression algorithm- The compression method is 
5 selected according to the characteristic of data to be compressed * 
If it is not necessary to use any data compression, the subtasks 
are transferred to the SQUAD processing units 120 without any 
compression « 

In the parallel computer having a hierarchy structure 
10 according to the present invention, the task is divided into 
a plurality of subtasks, and if necessary, the compression for 
the subtasks that has been divided jure executed w and then 
transferred to the following stage. Therefore the size of the 
subtask is more decreased at the processing unit in a lower stage , 
15 and the increasing of the bandwidth can be suppressed even if 
the operation frequency becomes high. 

When receiving the task data (or compressed task data if 
necessary) from the GHQ processor 213 in the GHQ processing unit 
210, the SQUAD commander processor 223 in the SQUAD processing 
20 unit 220 sends to the GHQ processing unit 210 the information 
that the status of the SQUAD processing unit 220 enters a busy 
state. Then, when the received task has been compressed, the 
SQUAD commander processor 223 decompresses the received task 
data. 

25 On the contrary , the SQUAD commander processor 223 in the 

SQUAD processing unit 220 further divides the received task data 
in order to assign the divided task to each FLIGHT processing 
unit 230. After the division process of the task, the SQUAD 
processing unit 220 compresses the divided task and then 

30 transfers the compressed tasks to the FLIGHT processing units 
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230* If it is improperly or not necessary *o divide the task, 
the task that has not been divided is transferred to the FLIGHT 
processing units 230. When receiving the task from the SQUAD 
processing unit 220 (or compressed task data if necessary) , the 
5 FLIGHT processing unit 230 sends to the SQUAD processing unit 
220 the request to set the status of the FLIGHT processing unit 
230 to the busy state. 

Then, when the received task has been compressed, the 
FLIGHT processing unit 230 decompresses the received task data* 

10 Tlie FLIGHT processing units 230 further divide the received 

task into a plurality of tasks and then transfers the divided 
task data to each FIGHTER processing unit 240. Where m the task 
data means the content of the processing and necessary data* 
That is , the main function of both the QUAD processing unit 220 

15 and the FLIGHT processing unit 230 as an intermediate node is 
a scheduling and data transfer. The FIGHTER processing units 
240 at tlxe lowermost stage performs the actual processing of 
the task. When receiving the task data, the FIGHTER processing 
units 240 sends to the FLIGHT processing unit 230 at the upper 

20 stage the request to set the state of the corresponding FIGHTER 
processing unit 240 to the busy state, and then the FIGHTER 
processing unit 240 processes the received task data. After the 
completion of the process of the task data, the FIGHTER processing 
unit 240 transfers tb« operation result to the FLIGHTER commander 

25 processor 223 in the F LIGHTER processing unit 230, and then the 
status of the FIGHTER processing unit 240 is set to the idle 
state „ 

When detecting the FIGHTER processing unit 240 in the idle 
state, the FLIGHTER processing unit 230 assigns the task data 
30 that has not been processed to this FIGHTER processing unit 240 
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in the idle state* 

When all of the task data items divided by one FLIGHT 
processing unit 230 have been processed by the FIGHTER processing 
units 240 , the FLIGHT processing unit 230 transfers the operation 
5 result to the SQUAD processing unit 220, and then this SQUAD 
processing unit 220 sets the status of the FLIGHT processing 
unit 230 from the busy state to the Idle state* 

Like the operation of the FLIGHT processing unit 230, when 
detecting the FLIGHT processing unit 230 In the idle state, the 
10 SQUAD processing unit 220 assigns un-prooessed task to this 
FLIGHT processing unit 130. 

Similarly, when receiving the operation results from all 
of the FLIGHT processing units 230 at the lower stage, the SQUAD 
processing unit 220 sends the operation result to the GHQ 
15 processing unit 210 in the uppermost stage. Thereby, the GHQ 
processing unit 210 sets the idle state to the SQUAD processing 
unit 220. 

That is , like the operation of the FLIGHT processing unit 
230, when detecting the SQUAD processing unit 220 in the Idle 

20 state and when there is un-processed task that is not processed, 
the GHQ processing unit 210 assigns the un-processed task to 
the SQUAD processing unit 220. When the SQUAD processing unit 
220 completes the operation of all of the tasks from the GHQ 
processing unit 210, the operation of the given program is 

25 completed . 

As described above, the FIGHTER processing units 240 
as the lowermost stage, the SQUAD processing units 220 and the 
FLIGHT processing units 230 in the intermediate stage, and the 
GHQ processing unit 210 in the uppermost stage performs different 

30 operation to each other • 
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That la , because each FIGHTER processing unit 240 performs 
the actual arithmetic operation, it is not necessary to have 
the function to perform complicated decisions and routines, but 
it is necessary to have a function of a high arithmetic calculation + 
Accordingly, it is preferable that each FIGHTER processor* 243 
has plural integer arithmetic units and floating-point 
arithmetic units. In this embodiment, for example, the FIGHTER 
processing unit 240 includes one integer arithmetic unit and 
two floating-point arithmetic units, and hazard processing 
circuits and interrupt circuits to Increase operations at high 
speed are omitted. Accordingly, when the operation frequency 
is 1.5GHz, it is possible for the parallel computer having a 
hierarchy structure of this embodiment to perform the operation 
of 24GFLOFS. 

On the other hand, the function of the SQUAD processing 
units 220 and the FLIGHT processing units 230 in the intermediate 
stage is a broker, namely, they control tixe data transfer between 
the upper stage Cor the uppermost stage) to the lower stage (or 
the lowermost stage} . Accordingly, it is adequate that each of 
the SQUAD commander processor 223 and the FLIGHT commander 
processor 233 incorporates an arithmetic unit of tbe smallest 
operation size . In this embodiment , each of the SQUAD commander 
processor 223 and the FL I GHT commander proces sor 233 incorporate s 
one integer arithmetic unit. 

Because the GHQ processing unit 210 executes a main program , 
a general-purpose processor is used as the GHQ commander 
processor 213 • It is therefore possible to use a microprocessor 
of a high, performance as the GHQ commander processor 213. 

Accordingly, the configuration of the second embodiment 
of the present invention is realized based on the following 
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technical idea. 

The six components, the memory module forming the GHQ main 
memory 211, the GHQ processor 213 , and the four multi-chip modules 
220 are synchronization with the synchronous clock of 187.5MHz. 
In this stage, the frequency of this synchronous clock let 
suppressed to a relatively low value because it is necessary 
to synchronize the six components placed in a wide area. By the 
way, the GHQ main memory 211 operates based on the clock of 400MKz 
that is used for asynchronous data transfer, not for synchronous 
data transfer. 

Next, each SQUAD processing unit 220 receives the 
synchronous clock of l87*5MHz from the GHQ processing unit 210, 
ana a Phase Locked Loop (PLL) (not shown) generates the 
synchronous clock of 3 7 5MHz that is 2 times of the synchronous 
clock of 187.5MHz. This synchronous clock of 375MHz is used as 
a synchronous clock in each SQUAD processing unit 220 - The four 
FLIGHT processing units 230, the SQUAD commander processor 223, 
the SQUAD instruction memory 225 , the SQUAD data memory 227, 
and the SQUAD DMA memory 221 operate in synchronization with 
this synchronous clock of 375MHz, 

One region in the SQUAD processing unit 220 is integrated 
to a part of the area of the GHQ processing unit 210, so that 
it is possible to decrease both a signal transfer length and 
a signal skew, and also possible to operate at a high frequency. 

Next, each FLIGHT processing unit 230 receives the 
synchronous clock of 375MHz from the SQUAD processing unit 220, 
and a PLL (not shown) or another circuit generates the synchronous 
clock of 750MHz that is approximately 2 times of the synchronous 
clock of 375MHz. This synchronous clock of 750MHz is used as 
a synchronous dock in each FLIGHT processing unit 230. The 
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sixteen FIGHTER processing units 240, the FLIGHT commander 
processor 233, the FLIGHT instruction memory 235, the FLIGHT 
data memory 237, and the FLIGHT DMA memory 231 operate in 
synchronization with this synchronous clock of 750MBz. 
5 One region in the FLIGHT processing unit 230 is integrated 

to a part of the area of the SQUAD processing unit 120, so that 
it is possible to operate at a higher frequency. 

Furthermore, each FIGHTER processing unit 240 receives 
the synchronous clock of 750MHz from the FIGHTER processing unit 

10 230, and a FLL (not shown) or another circuit generates the 
synchronous clock of 1.5GHz that is approximately 2 times of 
the synchronous clock of 750MHz . This synchronous clock of 1 . 5GHz 
is used as a synchronous clock in each FIGHTER processing 
unit 240. Each FLIGHT commander processor 233 and the FIGHTER 

1 5 memory 241 operate in synchronization with this synchronous clock 
of 1.5GHz. The FIGHTER processing unit 240 is integrated into 
a small region, so that it is possible to reduce a signal transfer 
length and a signal skew as small as possible, and it is thereby 
possible to operate at a high frequency, 

20 Although the internal processes in the FLIGHT processing 

unit 230 operate based on the synchronous clock of 755MHz, it 
is difficult for the entire of the GHQ processing unit 210 to 
operate with the clock of 755MHz, Accordingly, the different 
FLIGHT processing units 230 can not operate synchronously* 

25 However, there is no problem if the SQUAD processing units in 
the upper stage may operate synchronously. 

Next, a description will be given of the configuration 
of the intermediate stage r namely, the aonf iguratlon of the SQUAD 
processing unit 220 and the FLIGHT processing unit 230, in the 

30 parallel computer of the present invention. 



35 



As shown in FIG. 5 that has been used In the explanation 
of the first embodiment, FIG. 5 shows the configuration of one 
unit in the intermediate stage* 

In the configuration of the intermediate stage shown in 
5 FIG- 5, the general-purpose processor as the GHQ commander 
processor 223 is connected to a Direct Memory Access (DMA) 
controller 151 of 10 channels* Because this DMA controller 151 
and the general -purpose processor 223 are in a coprocessor 
connection, it is possible to use an available DMA controller. 

10 The DMA controller 151 is connected to a bus through which 

a memory 221 of a large memory size (as the SQUAD DMA memory), 
a connection line to the upper stage, and a connection line to 
the lower stage are connected* A processor core in the processor 
223 has signal lines through which a status signal from each 

15 processor In the lower stage is transf erred . For example, one 
SQUAD processing unit 220 receives status signals through four 
status signal lines connected to the four FLIGHT processing units 
230 in the lower stage. 

Each status signal line is one bit or more. The status 

20 signal indicates whether the processor in the lower stage is 
in the busy state or the idle state. 

The SQUAD commander processor 223 is connected to the SQUAD 
instruction memory 225 and the SQUAD data memory 227 in which 
programs and data to be used for the SQUAD commander processor 

25 223 are stored. These programs expands (or unwinds) data 
transferred from the upper stage if necessary, analyses commands 
also transferred from the upper stage, and performs required 
processes. Then, these programs assign tasks and perform 
scheduling, and finally transfer the data to be processed to 

30 the lower stage. 
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As one concrete e x a m ple , first, the data, to be processed, 
that are assigned to the target processing unit are transferred 
to the DMA transfer memory , Second, the data are transferred 
to the processing unit in the lower stage that is capable of 
processing the data. This algorism may be implemented by the 
program that has been stored in the SQUAD data memory 227. 

In other words, the processing unit in the Intermediate 
stage fulfills the function as an intelligent DMA system in the 
entire parallel computer of the present invention. 

In a case of a system only for a specialized process , that 
is not necessary to apply a versatile purpose, for example a 
graphic simulator and the like, it is possible to implement the 
processors other than the GHQ commander processor 113 by using 
non-Neumann type DMA processor including the DMA controller that 
is implemented by the hardware. 

Next, a description will be given of the memory structure 
used in the first embodiment. 

The easiest implementation of the memory structure is that 
each of the full processors in the parallel computer has a local 
memory space . Because each processor has a corresponding local 
memory space, it is not necessary to prepare any snoop-bus 
protocol and any coherent transaction . In this case , the memory 
space for the GHQ processor 213 is mapped only in the GHQ main 
memory 211- The memory space of the SQUAD commander processor 
223 is mapped in the SQUAD DMA memory 221 with the SQUAD instruction 
memory 225 and the SQUAD data memory 227, 

The memory space for the GHQ processor 213 and the memory 
space of the SQUAD commander processor 223 are independently 
to each other , Furthermore , each of the different SQUAD aommander 
processors 223 is independently to each other* 
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Similarly, the memory space of the FLIGHT commander 
processor 233 ±s mapped in the PLIGHT DMA memory 231 with the 
FLIGHT instruction memory 235 and the PLIGHT data memory 237, 
The memory space of the FLIGHT commander processor 233 ±s 
Independently from the memory spaces of both the GHQ processor 
213 and the SQUAD commander processor 233- Moreover, each of 
the FLIGHT commander processors 233 is Independently to each 
other. 

Similarly, the memory space of each FIGHTER processor 243 
is mapped in the corresponding FIGHTER memory 241 of 6 4Kbyte s. 
The memory apace of the FIGHTER processor 243 is independently 
from the memory space for the GHQ processor 213 , the memory space 
of each of the SQUAD commander processor 223. Furthermore, each 
of the FIGHTER processors 243 is independently to each other. 

It is also possible to divide the memory space of the GHQ 
processor 213 into a plurality of memory spaces in order to map 
the divided memory spaces for the full processors in the parallel 
computer according to the second embodiment. In this 
configuration, a move Instruction in the GHQ memory 211 is used 
for the data transfer between the upper stage and the lower stage , 

The move instruction for the memory may be Implemented 
as a DMA command to be used between the upper stage and the lower 
stage. In this case, there is a method to share the same address 
by both the actual memory of the SQUAD processing unit 220 and 
the actual memory of the GHQ processing unit 210, However, since 
the program executed by the GHQ processor 213 controls completely 
the execution state of the full processing units, it is not 
necessary to prepare any snoop -bus protocol and any coherent 
transaction. Similarly, the actual memory of the FLIGHT 
processing units 230 and the SQUAD processing units 220 share 
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the same address- In addition, both the actual memory of the 
FIGHTER processing units 240 and the actual memory of the FLIGHT 
processing units 230 snare the same address* 

By the way, although the preferred first and second 
5 embodiments described above have shown the multiprocessor system 
of a hierarchy bus structure , as shown in FIGs * X to 4 , the present 
invention is not limited by these configurations. 

FIG . 10A and FIG . 10B are diagrams each showing a connection 
structure of processing units in the parallel computer having 

10 a hierarchy structure according to the present invention* For 
example, as shown in both the FIG. 10A and FIG, 10B r it is possible 
to apply the concept of the present invention to various 
connection configurations, namely, it is possible to form the 
connection among the FLIGHT processing unit 130 and the 

15 corresponding FIGHTER processing units 140 in the parallel 
computer of the present invention based on a cross -bus connection 
(shown in FIG*10A), a star connection (shown in FIG.10B), or 
other connections. 

20 <Dlf ference in performance between the present Invention and 
prior art> 

Next, a description will be given of the explanation of 
the difference in performance between the multiprocessor system 
having a hierarchy bus structure of the present Invention and 
25 that of the prior art. 

First, in the multiprocessor system of a hierarchy bus 
structure of the second embodiment shown in FIG* 3 in which each 
stage has only a cache , the estimation of the data transf er amount 
in a collision decision application is as follows: 
3Q In the consideration to perform the collision decision 
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between objects shown in linage, each object is divided into 
regions, each is called to as a bounding shape- Each collision 
decision is performed for all of the combination of the bounding 
shapes* When the bounding shape is a sphere shape, the collision 
5 decision between one bounding shape and another bounding shape 
can be expressed by following calculation: 

(x r x 2 ) 2 +{y r y 2 ) z +(Zi- z a) 2 

< (r r r 2 )\ 

The content of amount of the calculation is as follows: 
10 (1) Load for eight elements x lr y 1# z lr x 2 , y 2 , Z z r 

and r 2 X 4 bytes =32 bytes; 

(2) Six addition/ subtractions; 

(3) Four multiplications; and 

(4) One comparison. 

1 5 Accordingly, the total requires the calculation of 8 loads 

and 11FF- 

Xn the system Including the FIGHTER processor at the 
lowermost stage having the calculation ability of 

2FLsX 1.5GHz = 3GFLOPS, 
20 each FIGHTER processor has a collision decision ability 

of 3GFLOPS/11FP =* 275 MHz times/sec. 

This FIGHTER processor consumes data of 

3GFLOPS/llFPX32bytes =8.75 Gbytes/sec 
When the system has 128 processors X2FFX1.5GH2S-384GFLOFS, 
26 the collision decision ability becomes 384GFLOPS/11FP 

= 34. 9G times/sea * 
In l/60seoond, it becomes 384GFtOPS/llFP/60 

= 580M times/frame - 
This equals to 7" (2X580M) =34 ,134MHz and means the ability 
30 to perform the collision decision between bounding shapes over 
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30 , 000 per 1/60 sco ~ The bandwidth to be necessary for tills ability 
is: 

FLIGHT bus : 8 * 75Gbytres/seo X 8 - 70Gbytes/see; and 
SQUAD bus : 706bytes/seo X 4 = 280Gbytes/sec. 
5 Next, a description will be given of tlxe case of the data 

expansion and the uniform load dispersion by using the processor 
in the intermediate node. 

As shown in FIG, 6, for example, the GHQ processing unit 
in the uppermost stage divides tasks of the source side and the 
10 target side Into subgroups (for example, 10 subgroups) , and then 
the processors process the divided subgroups per m Xm. The 
subgroups are dispersed by the DMA to the processors In the idle 
state. That is, the tasks are processes in the load dispersion 
while checking which processor is in the idle state. Thereby, 
1 5 even if one processor detects the collision so that the processing 
time becomes long, the entire processors can disperse the tasks - 
For example, when it is necessary to perform the collision 
decisions for 100 , 000 bounding shapes, the 1/4 data fox: the total 
collision decisions are dispersed to each of the four SQUAD 
20 commander processors. 

As one example , as shown in FIG . 7 , the SQUAD 1 as the SQUAD 
commander processor handles 1 to n/2 bounding shapes, the SQUAD 
2 as the SQUAD commander processor handles 1 to n/4 and (n/2)+l 
to n bounding shapes , the SQUAD 3 as the SQUAD commander processor 
25 handles the (n/4)+l to n/2 and (n/2)+l to n bounding shapes, 
and the SQUAD 4 as the SQUAD commander processor handles (n/2)-*-l 
to n bounding shapes . The GHQ processor 213 in the GHQ processing 
unit 110 performs these load dispersion and the DMA transfer. 

Of course, as shown in FIG. 8, it is equivalent that the 
30 SQUAD 1 handles 1 to n/2 bounding shapes, the SQUAD 2 handles 
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(n/2)+l to 3n/4andlto (n/2) bounding shapes , the SQUAD 3 handles 
the (3n/4)+l to n and 1 to n/2 bounding shapes, and the SQUAD 
4 handles (n/2)+l to n bounding shapes. 

Next, each SQUAD processing unit in the intermediate stage 
5 disperses the tasks by the same manner described above. 

The SQUAD commander processor handles the load dispersion 
and the DMA transfer. For example, as shown in the configurations 
of FIG.l and FIG. 3, the SQUAD 2 (as the SQUAD processing unit) 
disperses the load into the FLIGHT 1 to FLIGHT 4 (as the FLIGHT 
10 processing units). In this case, there la no difference of the 
dispersion efficiency when compared with that of the GHQ 
processing unit . Because there are the sixteen FLIGHT processing 
units in the system, 1/16 collision decisions in the total 
collision decisions to be processes are assigned to each FLIGHT 
1 5 processing unit when the total number of the collision decisions 
are equally divided by 16* The maximum data amount tos be stored 
in each FLIGHT processing unit is approximately (1/4+1/8^) 3/8 
of the total amount of data. 

Each FLIGHT commander processor further divides the 
20 received data into small- sized regions based on the sub-group 
method described above . The amount of the division is determined 
based on the received data amount. 

The group of the divided collision decisions is assigned 
to each FLIGHT commander processor in order to execute the 
25 collision decision operation. Each FLIGHT commander processor 
executes a flat decision for the collision. 

The amount of this data transfer will be estimated in an 
optimized case, when the GHQ bus transfers 1.6 Mbytes data to 
the four SQUAD processing units and updates the data per 1/60 
30 seconds, the speed of the data transfer becomes: 
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1.6MbytesX4 (SQUAD processing units) -r 1/60 seconds m 
3S4 Mbytes/sec. 

Because SQUAD bus transfers approximately 580Kbytes 
(578904 bytes} to the four FLIGHT processing units, the required 
5 data bus bandwidth becomes: 

580 Kbytes X 4 (FLIGHT processing units) 1/60 seconds 
» 139.2 Mbytes /see. 

On the other hand* the data bandwidth required for the 
FLIGHT bus becomes: 
10 1110/( l/60seconds) Xl6Kbytes approximately lGbytes/sec 
This value is approximately 1/140 of the 140 Gbytes/sec of the 
prior art* FIG* 9 is a diagram showing a comparison in operation 
between multiprocessor systems of a hierarchy bus structure of 
both a prior art and the present invention. That is, FIG, 9 shows 
15 the estimation results described above. 

Accordingly , the hierarchy of the units in the 
multiprocessor system of a hierarchy bus structure of the present 
invention can suppress a clock skew and execute high speed 
processors of desired numbers in front end technology in 
20 parallel. 

While the above provides a full and complete disclosure 
of the preferred embodiments 1 and 2 for the parallel computer 
having a hierarchy structure according to the present invention, 
various modifications , alternate constructions and equivalents 
25 may be employed without departing from the scope of the invention 
by a person with ordinary skill in the art to which the present 
invention pertains . Therefore the above description and 
illustration should not be construed as limiting the scope of 
the Invention, which is defined by the appended claims. 

30 
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WHAT IS CLAIMED IS: 

1. A parallel computer having a hierarchy structure 
comprising* 

5 an upper processing unit for executing a parallel 

processing task in parallel; and 

a plurality of lower processing units connected to the 
upper processing unit through a connection line, 

wherein the upper processing unit divides the parallel 
10 processing task to a plurality of sub tasks, and assigns the 
plurality of subt asks to the corresponding lower proces sing units 
and transfers data to be required for executing the plurality 
of subtasks to the lower processing units, and 

the lower processing units execute the corresponding 
15 subtasks from the upper processing unit, and Inform the 
completion of the execution of the corresponding subtasks to 
the upper processing unit when the execution of the subtasks 
is completed, and 

the upper processing unit completes the parallel 
20 processing task when receiving the Information of the completion 
of the execution from all of the lower processing units. 

2. A parallel computer having a hierarchy structure 
comprising? 

25 an upper processing unit for executing a parallel 

processing task in parallel; 

a plurality of intermediate processing units connected 
to the upper processing unit through a first connection line; 
and 

30 a plurality of lower processing units connected to the 
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intermediate processing units through a second connect ion line, 
wherein the upper processing unit divides the parallel 
processing task to a plurality of first subtasks, and assigns 
the plurality of first subtaslcs to the corresponding intermediate 
processing units , and transfers data to be required for executing 
the plurality of first subtasks to the intermediate processing 
units , and 

the intermediate processing units divide the first 
subtasks to a plurality of second subtasks, and assigns the 
plurality of second subtasks to the corresponding lower 
processing units , and transfers data to be required for executing 
the plurality of second subtasks to the lower processing units, 
and 

the lower processing units execute the corresponding 
second subtasks, and inform the completion of the execution of 
the second subtasks to the corresponding intermediate processing 
unit s when the execution of all of the second subtasks is completed , 
and 

the intermediate processing units Inform the completion 
of the execution of the corresponding second subtasks to the 
upper processing units when the execution of all of the first 
subtasks is completed, and 

the upper processing unit completes the parallel 
processing task when receiving the information of the completion 
of the execution from all of the intermediate processing units. 

3 . A parallel computer having a hierarchy structure according 
to claim 1, wherein the lower processing units connected to the 
connection line are mounted on a smaller area when compared with 
the upper processing unit, and a signal line through which each 
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lower processing unit is connected has a smaller wiring capacity, 
and an operation frequency for the lower processing units is 
higher than that for the upper processing unit, 

5 4, A parallel computer having a hierarchy structure according 
to claim 2, wherein the lower processing units connected to the 
second connection line are mounted on a smaller area when compared 
with the intermediate processing units connected to the first 
connection line, and a signal line through which each lower 
10 processing unit is connected has a smaller wiring capacity , and 
an operation frequency for the lower processing units is higher 
than that for the intermediate processing units, 

5 . A parallel computer having a hierarchy structure according 
15 to claim 3, wherein each of the upper processing unit and the 

lower processing units has a processor and a memory connected 
to the processor, 

6 . A parallel computer having a hierarchy structure according 
20 to claim 4, wherein each of the upper processing unit, the 

intermediate processing units , and the lower processing units 
has a processor and a memory connected to the processor. 

7 . A parallel computer having ahierarchy structure according 
25 to claim 3, wherein the upper processing unit receives 

Information regarding the completion of the subtask from each 
lower processing unit through a status signal line. 

8 . A parallel computer having a hierarchy structure according 
30 to claim 4, wherein each intermediate processing unit and the 
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upper processing unit receives information regarding the 
completion of the second subtask and the first subtask from each, 
lower processing unit and each intermediate processing unit 
through a status signal line, respectively, 

S 

9 . A parallel computer having a hierarchy structure according 
to claim 3, wherein each lower processing unit comprises a 
processor, and a memory and a DMA controller connected to the 
processor. 

10 

1 0 . A parallel computer having a hierarchy structure according 
to claim 4, wherein each intermediate processing unit comprises 
a ptoo&bbot, and a memory and a DMA controller connected to the 
processor. 

15 

1 1 - A parallel computer having a hierarchy structure according 
to claim 9, wherein the processor and the DMA controller are 
connected in a coprocessor connection. 

20 12 . A parallel computer having a hierarchy structure according 
to claim 10, wherein the processor and the DMA controller are 
connected in a coprocessor connection. 

13 . A parallel computer having a hierarchy structure according 
25 to claim 3, wherein the upper processing unit compresses the 

data to be required for executing the subtasKs , and then transfers 
the compressed data to the corresponding lower processing units . 

14 . Aparallel computer having a hierarchy structure according 
30 to claim 4, wherein the upper processing unit compresses the 
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data to be required for executing the first subtasks, and then 
transfers the compressed data to the corresponding Intermediate 
processing units. 

5 15, A parallel computer having a hierarchy structure according 
to olaixn 5 , wherein each intermediate processing unit compresses 
the data to be required for executing the second subtasks, and 
then transfers the compressed data to the corresponding lower 
processing units. 

10 

16. A parallel computer having a hierarchy structure according 
to claim 4, wherein each intermediate processing unit is a DMA 
transfer processing unit. 

15 1 7 - A parallel computer having a hierarchy structure according 
to claim 16, wherein the DMA transfer processing unit is a 
programmable * 

18 . A parallel computer having ahierarahy structure according 
20 to claim 1, wherein each lower processing unit is mounted with 

the upper processing unit as a multi-chip module on a board. 

19 - A parallel computer having a hierarchy structure according 
to claim Z, wherein each intermediate processing unit and the 

25 corresponding lower processing units are mounted with the upper 
processing unit as a multi-chip module on a board. 

20. A parallel computer having a hierarchy structure according 
to claim 1, wherein each of the upper processing unit and the 
30 lower processing units is formed on an independent semiconductor 
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chip, and. each semiconductor ahlp Is mounted as a single 
multi-chip module* 

21. A parallel computer Iiaving a hierarchy structure according 
5 to claim 2, wherein each of the Intermediate processing units, 
the corresponding lower processing units, and the upper 
processing unit Is formed on an Independent semiconductor chip, 
and each, semiconductor chip Is mounted as a single multi-chip 
module * 

10 

2 2 * A parallel computer having a hierarchy structure according 
to claim 1, wherein a structure of the connection line Is a common 
bus connection. 

15 23 . A parallel computer having a hierarchy structure according 
to claim 2, wherein a structure of each of the first connection 
line and the second connection line la a common bus connection. 

24 „ A parallel computer having a hierarchy structure according 
20 to claim 1, wherein a structure of the connection line is a 
cross -bus connection. 

25 - Aparallel computer having a hierarchy structure according 
to claim 2 , wherein a structure of each of the first connection 
25 line and the second connection line is a cross -bus connection. 

26 , A parallel computer having a hierarchy structure according 
to claim 1, wherein a structure of the connection line is a star 
connection. 

30 
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27- A parallel computer having a hierarchy structure according 
to claim Z M wherein a structure of each of the first collection 
line and the second connection line is a star connection. 
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ABSTRACT OF THE DISCLOSURE 

In a multiprocessor system of a hierarchy configuration 
as a parallel computer of a common -bus structure, a processing 
unit (120) in an intermediate stage has a processor (123) Having 
5 a programmable function that is equal to a normal processor, 
an instruction memory (125) , and a data memory (127), The 
processing unit (120) receives a status signal from a lower 
processor (143) , and a DMA controller (151) having a memory for 
the transfer of large sized data performs compression, 
10 decompression, programmable load dispersion, and load 
dispersion according to the state of operation of each lower 
processor. 
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